Many factors influence individual’s health, such as physical exercise, sleep, nutrition, heredity and pollution. Being nutrition one of the biggest modifiable factors in our lives, small changes can have a big impact. With the exponential increase in the number of available food options, it is not possible to take them all into account anymore. The only way to consider user taste preferences, maximize the number of healthy compounds and minimize the unhealthy ones in food, is using (3D) recommendation systems.
The goal of this project was to use the largest publicly available collection of recipe data (Recipe1M+) to build a recommendation system for ingredients and recipes. Train, evaluate and test a model able to predict cuisines from sets of ingredients. Estimate the probability of negative recipe-drug interactions based on the predicted cuisine. Finally, to build a web application as a step forward in building a 3D recommendation system.
A vectorial representation for every ingredient and recipe was generated using Word2Vec. An SVC model was trained to return recipes’ cuisines from their set of ingredients. South Asian, East Asian and North American cuisines were predicted with more than 73% accuracy. African, Southern European and Middle East cuisines contain the highest number of cancer-beating molecules. Finally, it was developed a web application able to predict the ingredients from an image, suggest new combinations and retrieve the cuisine the recipe belongs, along with a score for the expected number of negative interactions with antineoplastic drugs (github.com/warcraft12321/HyperFoods).
Importing libraries installed using PyPI and functions present in scripts created in for this project.
# ---------------------------- Data Management ----------------------------
# pandas is an open source library providing high-performance, easy-to-use data structures and data
# analysis tools for the Python programming language.
import pandas
# ---------------------------- Scientific Operations ----------------------------
# NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful
# N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code,
# useful linear algebra, Fourier transform, and random number capabilities.
import numpy
# ---------------------------- Write & Read JSON Files ----------------------------
# Python has a built-in package which can be used to work with JSON data.
import json
# ---------------------------- Pickling ----------------------------
# The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling”
# is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse
# operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
import pickle
# ------------------------------------- Word2Vec -------------------------------------
# Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural
# networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of
# text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being
# assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that
# share common contexts in the corpus are located close to one another in the space.
# Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target
# audience is the natural language processing (NLP) and information retrieval (IR) community.
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
# -------------------------- Dimensionality Reduction Tools --------------------------
# Scikit-learn (also known as sklearn) is a free software machine learning library for the
# Python programming language.It features various classification, regression and clustering algorithms including
# support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with
# the Python numerical and scientific libraries NumPy and SciPy.
# Principal component analysis (PCA) - Linear dimensionality reduction using Singular Value Decomposition of the data to
# project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying
# the SVD.
# t-distributed Stochastic Neighbor Embedding (t-SNE) - It is a tool to visualize high-dimensional data. It converts
# similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between
# the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that
# is not convex, i.e. with different initializations we can get different results.
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# ------------------------------ Check File Existance -------------------------------
# The main purpose of the OS module is to interact with the operating system. Its primary use consists in
# creating folders, removing folders, moving folders, and sometimes changing the working directory.
from os import path
# ------------------------ Designed Visualization Functions -------------------------
# Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
# and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython
# shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
# Plotly's Python graphing library makes interactive, publication-quality graphs. You can use it to make line plots,
# scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar
# charts, and bubble charts.
# Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing
# attractive and informative statistical graphics.
from algorithms.view.matplotlib_designed import matplotlib_function
from algorithms.view.plotly_designed import plotly_function
from algorithms.view.seaborn_designed import seaborn_function
import matplotlib.pyplot as pyplot
import matplotlib as mpl
import matplotlib.cm as cm
import seaborn
# ------------------------ Retrieving Ingredients, Units and Quantities -------------------------
from algorithms.parsing.ingredient_quantities import ingredient_quantities
from algorithms.parsing.ingredient_quantities_Copy11 import ingredient_quantities
# ------------------------ Correcting Recipe1M+ Dataset -------------------------
from algorithms.fractions.correct_fractions_recipe1M import corrector
# ------------------------ Creating Vocabulary Units -------------------------
from algorithms.vocabulary.units import create_unit_vocab
# ------------------------ Create Distance Matrix -------------------------
# SciPy is a free and open-source Python library used for scientific and technical computing. SciPy contains modules for
# optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE
# solvers and other tasks common in science and engineering.
# distance_matrix returns the matrix of all pair-wise distances.
from scipy.spatial import distance_matrix
# ------------------------ Unsupervised Learning -------------------------
#
from clustering.infomapAlgorithm import infomap_function # Infomap algorithm detects communities in large networks with the map equation framework.
from sklearn.cluster import DBSCAN, MeanShift # DBSCAN, Meanshift
import community # Louvain
from sklearn.cluster import SpectralClustering
# ------------------------ Supervised Learning -------------------------
# GridSearchCV to choose the right tuning parameters. confusion_matrix used to evaluate the model.
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold, LeaveOneOut
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
# ------------------------ Jupyter Notebook Widgets -------------------------
# Interactive HTML widgets for Jupyter notebooks and the IPython kernel.
import ipywidgets as w
from IPython.core.display import display
from IPython.display import Image
# ------------------------ IoU Score -------------------------
# The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the
# French name coefficient de communauté by Paul Jaccard), is a statistic used for gauging the similarity and diversity
# of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of
# the intersection divided by the size of the union of the sample sets.
# Function implemented during this project.
from benchmark.iou_designed import iou_function
# ------------------------ F1 Score -------------------------
# The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best
# value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The
# formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall)
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
# ------------------------ API Requests -------------------------
# The requests library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of
# making requests behind a beautiful, simple API so that you can focus on interacting with services and consuming data
# in your application.
import requests
# ------------------------ RegEx -------------------------
# A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
# RegEx can be used to check if a string contains the specified search pattern.
# Python has a built-in package called re, which can be used to work with Regular Expressions.
import re
# ------------------------ Inflect -------------------------
# Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words.
import inflect
# ------------------------ Parse URLs -------------------------
# This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing
# scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL”
# to an absolute URL given a “base URL.”
from urllib.parse import urlparse
# ------------------------ Embedding HTML -------------------------
# Public API for display tools in IPython.
from IPython.display import HTML
# ------------------------ Creating Graph -------------------------
# NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of
# complex networks.
import networkx
# ------------------------ Language Detectors -------------------------
# TextBlob requires API connnection to Google translating tool (low limit on the number of requests). langdetect is an offline detector.
from textblob import TextBlob
from langdetect import detect
# ------------------------ Language Detectors -------------------------
# In Python, string.punctuation will give the all sets of punctuation: !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~
import string
# ------------------------ CSV Reader -------------------------
# CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases.
import csv
# ------------------------ Natural Language Processing -------------------------
# NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to
# over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification,
# tokenization, stemming, tagging, parsing, semantic reasoning and wrappers for industrial-strength NLP libraries.
# stopwords returns all the stopwords present in the english language (e.g.: and, to, for...).
# wordnet allows the user to test if a word belongs to the class of verbs, nouns, adjectives (...).
# WordNetLemmatizer lemmatizes a given word by transforming it in a more fundamental version of itself (e.g.: better -> good or nouns -> noun).
# word_tokenize tokenizes a string provided as input. It separates different words, as well as punctuation from them.
# webcolors returns a set of the most used color names.
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import webcolors
# ------------------------ Parallel Code Execution -------------------------
# joblib provides a simple helper class to write parallel for loops using multiprocessing.
# multiprocessing detects the number of cores in the executing machine.
from joblib import Parallel, delayed
import multiprocessing
# ------------------------ Others -------------------------
import math
import ast
import random
%load_ext autoreload
%%time
# This was the chosen dataset to perform all data analysis.
# After testing different python modules (json and simplejson) to load Recipe1M+ JSON file, it was concluded they present similar performances.
# ---------------------------- Importing Recipe1M+ Dataset ----------------------------
f_recipe1M = open('./data/recipe1M+/noEmptyIngredientsOrInstructions/noEmptyIngredientOrInstructionRecipes/fractionsCorrected/fractionsCorrected.json')
recipes_data = (json.load(f_recipe1M))#[0:100000] # Regular computer able to read Recipe1M+ full dataset.
f_recipe1M.close()
id_ingredients = {}
id_url = {}
id_tea = {}
id_salad = {}
lemmatizer = WordNetLemmatizer()
for recipe in recipes_data:
id_ingredients[recipe["id"]] = []
id_url[recipe["id"]] = recipe["url"]
title_tokenized_lemmatized = word_tokenize(recipe["title"])
title_tokenized_lemmatized = [lemmatizer.lemmatize(w.lower()) for w in title_tokenized_lemmatized]
if "tea" in title_tokenized_lemmatized:
id_tea[recipe["id"]] = recipe["title"]
if "salad" in title_tokenized_lemmatized:
id_salad[recipe["id"]] = recipe["title"]
for index, ingredient in enumerate(recipe["ingredients"]):
id_ingredients[recipe["id"]].append({"id": index, "ingredient": (ingredient["text"]).lower()})
print(list(id_tea.keys())[0:20])
print(list(id_tea.values())[0:20])
# Online websites parsed to retrieve recipes.
if not path.exists("./data/Recipe1M+/allRecipeDatabases.txt"):
recipe_databases = []
for key, value in id_url.items():
parsed_uri = urlparse(value)
result = '{uri.scheme}://{uri.netloc}'.format(uri=parsed_uri)
recipe_databases.append(result)
recipe_databases_list = list(set(recipe_databases)) # The common approach to get a unique collection of items is to use a set. Sets are
# unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in
# set() function. If you later need a real list again, you can similarly pass the set to the list() function.
else:
recipe_databases_list = []
f_urls = open("./data/Recipe1M+/allRecipeDatabases.txt", "r")
for x in f_urls:
recipe_databases_list.append(x.replace("\n", ""))
f_urls.close()
print(recipe_databases_list)
with open('./data/recipe1M+/allRecipeDatabases.txt', 'w') as f:
for key, item in enumerate(recipe_databases_list):
if item != "" and key < len(recipe_databases_list) - 1:
f.write("%s\n" % item)
elif item != "":
f.write("%s" % item)
# Dataset used to train Support Vector Classifier. It contains set of ingredients and the associated cuisine for each recipe.
f_kaggleNature = open('./data/kaggle_and_nature/kaggle_and_nature.csv', newline = '')
game_reader = csv.reader(f_kaggleNature, delimiter='\t')
id_ingredients_cuisine = []
cuisines = []
i = 0
for game in game_reader:
id_ingredients_cuisine.append({"id": i, "ingredients": [ingredient.replace("_", " ") for ingredient in game[0].split(",")[1:]], "cuisine": game[0].split(",")[0]})
cuisines.append(game[0].split(",")[0])
i = i + 1
# Removing Ponctuation, Stopwords and Performing Lemmatization
if not path.exists("./data/kaggle_and_nature/new_id_ingredients_cuisine101010.json"):
stop_words = set(stopwords.words('english'))
intab = '''!()-[]{};:'"\,<>?@#$%^&*_~'''
outtab = "_" * len(intab)
trantab = str.maketrans(intab, outtab)
lemmatizer = WordNetLemmatizer()
new_id_ingredients_cuisine = []
max_number_ingredients = 0
set_ingredients = set()
for recipe_number, recipe in enumerate(id_ingredients_cuisine):
ingredients_aux = []
for ingredient in recipe["ingredients"]:
word_tokens = word_tokenize(ingredient)
ingredient_modified = " ".join([lemmatizer.lemmatize(w.translate(trantab).replace("_", "").lower()) for w in word_tokens if w not in stop_words])
ingredients_aux.append(ingredient_modified)
set_ingredients.add(ingredient_modified)
new_id_ingredients_cuisine.append({"id": recipe_number, "ingredients": sorted(ingredients_aux), "cuisine": recipe["cuisine"]})
if max_number_ingredients < len(sorted(ingredients_aux)):
max_number_ingredients = len(sorted(ingredients_aux))
else:
f = open('./data/kaggle_and_nature/new_id_ingredients_cuisine.json')
new_id_ingredients_cuisine = json.load(f)
f.close()
# ---------------------------- Saving to JSON File ----------------------------
with open('./data/kaggle_and_nature/new_id_ingredients_cuisine.json', 'w') as json_file:
json.dump(new_id_ingredients_cuisine, json_file)
Getting the anticancer ingredients and the number of anticancer molecules each one contain. Further data processing to facilitate analysis.
ac_data = pandas.read_csv("./data/anticancer/food_compound_simplified.csv", delimiter = ",")
ac_data.head()
# Selecting Useful Anti-Cancer Ingredients Columns
ac_data_mod = ac_data[['Common Name', 'Number of CBMs']]
ac_data_mod
# Dropping Nan Rows from Anti-Cancer Ingredients Table
ac_data_mod.replace("", numpy.nan)
ac_data_mod = ac_data_mod.dropna()
ac_data_mod
# Converting DataFrame to Dictionary
ingredient_anticancer = {}
for index, row in ac_data_mod.iterrows():
ingredient_anticancer[row['Common Name'].lower()] = row['Number of CBMs']
ingredients_list = []
with open('./vocabulary/ingredients_vocabulary/ingr_vocab.pkl', 'rb') as f: # Includes every ingredient present in the dataset.
ingredients_list = pickle.load(f)
new_ingredients_list = [] # List of ingredients from the vocabulary with spaces instead of underscores.
for ingredient_vocab in ingredients_list:
if "_" in ingredient_vocab:
new_ingredients_list.append(ingredient_vocab.replace("_", " "))
continue
new_ingredients_list.append(ingredient_vocab) # In case there is no _
# ---------------------------- Optimizing Ingredients Vocabulary ----------------------------
if not path.exists("./vocabulary/ingredients_vocabulary/ingr_vocab_lemmatized_stopwords.txt"):
stop_words = set(stopwords.words('english'))
intab = '''!()-[]{};:'"\,<>?@#$%^&*_~'''
outtab = "_" * len(intab)
trantab = str.maketrans(intab, outtab)
lemmatizer = WordNetLemmatizer()
modified_ingredient_mediumSize = []
for key, value in enumerate(new_ingredients_list):
word_tokens = word_tokenize(value)
modified_ingredient_mediumSize.append(" ".join([lemmatizer.lemmatize(w.translate(trantab).replace("_", "").lower()) for w in word_tokens if w not in stop_words]))
else:
# After some manual adjustments.
new_ingredients_list = []
f_ingredients = open("./vocabulary/ingredients_vocabulary/ingr_vocab_lemmatized_stopwords.txt", "r")
for x in f_ingredients:
new_ingredients_list.append(x.replace("\n", ""))
f_ingredients.close()
with open('./vocabulary/ingredients_vocabulary/ingr_vocab_lemmatized_stopwords.txt', 'w') as f:
for key, item in enumerate(modified_ingredient_mediumSize):
if item != "" and key < len(modified_ingredient_mediumSize) - 1:
f.write("%s\n" % item)
elif item != "":
f.write("%s" % item)
create_unit_vocab(recipes_data, new_ingredients_list)
# ---------------------------- Optimizing Units Vocabulary ----------------------------
if not path.exists("./vocabulary/units_vocabulary/units_list_lemmatized_stopwords.txt"):
units_list = []
f_units = open("./vocabulary/units_vocabulary/units_list_final_filtered_lemmatized.txt", "r")
for x in f_units:
units_list.append(x.replace("\n", ""))
f.close()
new_units_list = []
for key, value in enumerate(units_list):
new_units_list.append(" ".join([lemmatizer.lemmatize(w.translate(trantab).replace("_", "").lower()) for w in value.split(" ") if w not in stop_words]))
else:
new_units_list = []
f_units = open("./vocabulary/units_vocabulary/units_list_lemmatized_stopwords.txt", "r")
for x in f_units:
new_units_list.append(x.replace("\n", ""))
with open('./vocabulary/units_vocabulary/units_list_lemmatized_stopwords.txt', 'w') as f:
for key, item in enumerate(modified_unit_mediumSize):
if item != "" and key < len(modified_unit_mediumSize) - 1:
f.write("%s\n" % item)
elif item != "":
f.write("%s" % item)
# Optimized to Recipe1M+.
corrector(recipes_data, new_ingredients_list)
%%time
ingredients_vocab = []
units_vocab = []
f = open("./vocabulary/ingredients_vocabulary/ingr_vocab_lemmatized_stopwords.txt", "r")
for x in f:
ingredients_vocab.append(x.replace("\n", ""))
f.close()
def order(e):
return len(e)
f = open('./vocabulary/units_vocabulary/units_list_lemmatized_stopwords.txt', "r")
for x in f:
units_vocab.append(x.replace("\n", ""))
f.close()
print(ingredient_quantities("1/2 teaspoon rose petal", ingredients_vocab, units_vocab))
'''
def auxx11 (recipe):
for ingredient in recipe["ingredients"]:
print(ingredient_quantities(ingredient["text"], ingredients_vocab, units_vocab))
return ingredient_quantities(ingredient["text"], ingredients_vocab, units_vocab)
num_cores = multiprocessing.cpu_count()
out_output = Parallel(n_jobs=num_cores)(delayed(auxx11)(recipe) for recipe in recipes_data) # [0:100000]
'''
if not path.exists("./id_listIngredients.json"):
f = open('./id_quantities_units_ingredients_grams.json')
id_quantities_units_ingredients_grams = (json.load(f)) # [0:100]
f.close()
id_listIngredients = {}
max_number_ingredients = 0
for recipe_id, recipe in id_quantities_units_ingredients_grams.items():
id_listIngredients[recipe_id] = []
for ingredients in recipe:
for ingredient in ingredients:
id_listIngredients[recipe_id].append(ingredient["ingredient"])
id_listIngredients[recipe_id] = sorted(set(id_listIngredients[recipe_id]))
if max_number_ingredients < len(id_listIngredients[recipe_id]):
max_number_ingredients = len(id_listIngredients[recipe_id])
else:
f = open('./id_listIngredients.json')
id_listIngredients = (json.load(f)) # [0:100]
f.close()
with open('./id_listIngredients.json', 'w') as json_file:
json.dump(id_listIngredients, json_file)
if not path.exists("./trained_models/word2vec.bin"):
f = open('./data/kaggle_and_nature/new_id_ingredients_cuisine.json')
new_id_ingredients_cuisine = (json.load(f)) # [0:100]
f.close()
# ---------------------------- Recipe1M+ Dataset ----------------------------
corpus1 = list(id_listIngredients.values())
# ---------------------------- Kaggle&Nature Dataset ----------------------------
corpus2 = []
for key, recipe in enumerate(new_id_ingredients_cuisine):
corpus2.append(recipe["ingredients"])
# ---------------------------- Generating Word2Vec Model ----------------------------
model = Word2Vec(corpus1 + corpus2, min_count = 1, size = 100, workers = 8, window = 65, sg = 0)
# By default, the model is saved in a binary format to save space.
model.save("word2vec.model")
# Save the learned model in ASCII format and review the contents
model.wv.save_word2vec_format('./trained_models/word2vec.txt', binary=False)
else:
model = Word2Vec.load("word2vec.model")
PCA and T-SNE intended to decrease the dimensionality (50) of the vectors representing ingredients, so that they can be plotted in visualizable way.
# X_ingredients_recipe1MKaggle = model[model.wv.vocab]
X_ingredients_recipe1M = [model[w] for w in model.wv.vocab if w in new_ingredients_list] # Not including Kaggle&Nature ingredient vectors.
X_ingredients_recipe1M_keys = [w for w in model.wv.vocab if w in new_ingredients_list] # Not including Kaggle&Nature ingredient vocabularies.
# ---------------------------- PCA ----------------------------
X_ingredients_embedded1 = PCA(n_components=2).fit_transform(X_ingredients_recipe1M)
# ---------------------------- T-SNE ----------------------------
X_ingredients_embedded2 = TSNE(n_components=2).fit_transform(X_ingredients_recipe1M)
Finding groups of ingredients that most often co-occur in the same recipes.
# ---------------------------- Build Distance Dataframe & Networkx Graph
data = X_ingredients_embedded1 # list(X_ingredients_embedded1) / model[model.wv.vocab]
ctys = X_ingredients_recipe1M_keys # list(model.wv.vocab)
df = pandas.DataFrame(data, index=ctys)
# distances = (pandas.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)).rdiv(1) # Creating dataframe from distance matrix between ingredient vectors.
#G = networkx.from_pandas_adjacency(distances) # Creating networkx graph from pandas dataframe.
# X = numpy.array(df.values) # Creating numpy array from pandas dataframe.
# ---------------------------- Clustering
# Spectral Clustering
ingredientModule = SpectralClustering(n_clusters=9, assign_labels="discretize", random_state=0, n_jobs=-1).fit_predict(data)
# Mean Shift
# ingredientModule = MeanShift(n_jobs=-1, cluster_all = True).fit(X).labels_
# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
# ingredientModule = DBSCAN(eps=0.3, min_samples=2, n_jobs=-1).fit(X).labels_ # Noisy samples are given the label -1.
# Louvain
#ingredientModule = list((community.best_partition(G)).values())
# Infomap
# ingredientModule = infomap_function(distances, ctys)
Retrieving how often different ingredients are used across the recipe dataset.
# ------------------------------- Counting Number of Ingredient Occurences across Recipe1M+
ingredients_count = {}
for recipe_id, ingredients in id_listIngredients.items():
for ingredient in ingredients:
if ingredient in ingredients_count.keys():
ingredients_count[ingredient] = ingredients_count[ingredient] + 1
else:
ingredients_count[ingredient] = 0
Retrieving nutritional information for each ingredient present in the Recipe1M+ vocabulary.
Overall recipe score will be calculated taking into account not only the number of cancer-beating molecules, but also
nutritional content.
Data Source: U.S. Department of Agriculture, Agricultural Research Service. FoodData Central, 2019. fdc.nal.usda.gov.
get_nutritional_content(new_ingredients_list) # new_ingredients_list should contain all the ingredients lemmatized, without punctuation and with no stop words.
threshold = 2500 # Hiding under-represented ingredients in the dataset.
indeces = [index for index, size in enumerate(list(ingredients_count.values())) if size > threshold]
input1 = numpy.array([size for index, size in enumerate(X_ingredients_embedded1) if index in indeces])
input2 = numpy.array([size for index, size in enumerate(X_ingredients_embedded2) if index in indeces])
input3 = [size for index, size in enumerate(list(ingredients_count.keys())) if index in indeces]
input4 = [size for index, size in enumerate(ingredientModule) if index in indeces]
input5 = [size for index, size in enumerate(list(ingredients_count.values())) if index in indeces]
input4_anticancer = []
for ingredient in input3:
i = 0
for antic_molecule in list(ingredient_anticancer.keys()):
i = i + 1
if antic_molecule in ingredient.split(" "):
input4_anticancer.append("green")
break
elif len(antic_molecule.split(" ")) > 1 and antic_molecule in ingredient:
input4_anticancer.append("green")
break
elif i == len(list(ingredient_anticancer.keys())):
input4_anticancer.append("black")
%autoreload 2
matplotlib_function(input1, input2, input3, input4_anticancer, input5, "Ingredients", True)
plotly_function(input1, input2, input3, input4, input5, "false", "Ingredients")
seaborn_function(X_ingredients_embedded1, X_ingredients_embedded2, list(ingredients_count.keys()), ingredientModule, ingredientSize)
Representing recipes in their vectorized way by taking the average of the vectors of the ingredients present. Calculated so that the size of each recipe marker could be proportional to the number of ingredients present.
%%time
f = open('./id_quantities_units_ingredients_grams.json')
final_grams = (json.load(f))# [0:100]
f.close()
recipe_weight = {}
countingIngredients = {}
# ---------------------------- Calculating total weight for each recipe ----------------------------
for recipe_id, recipe in final_grams.items():
recipe_weight[recipe_id] = 0
for ingredients in recipe:
for ingredient in ingredients:
if ingredient["quantity"]:
#recipe_weight[recipe_id] = recipe_weight[recipe_id] + ingredient["quantity"]
recipe_weight[recipe_id] = recipe_weight[recipe_id] + 1
if recipe_weight[recipe_id] == 0:
recipe_weight[recipe_id] = 1
# ---------------------------- Weighting Contribution each Ingredient ----------------------------
recipe_vector = {}
for recipe_id, recipe in final_grams.items():
countingIngredients[recipe_id] = 0
recipe_vector[recipe_id] = []
for ingredients in recipe:
if ingredients:
for ingredient in ingredients:
if not isinstance(ingredient, float) and ingredient["quantity"] != 0:
countingIngredients[recipe_id] = float(countingIngredients[recipe_id]) + 1
#recipe_vector[recipe_id].append(numpy.array(model[ingredient["ingredient"]]) * (float(ingredient["quantity"])))
recipe_vector[recipe_id].append(numpy.array(model[ingredient["ingredient"]]) * 1)
recipe_vector[recipe_id] = numpy.sum(numpy.array(recipe_vector[recipe_id]), 0) / recipe_weight[recipe_id]
new_list = {}
new_countingIngredients = {}
i = 0
for recipe_id, listy in recipe_vector.items():
if not isinstance(listy, numpy.ndarray):
i = i + 1
else:
new_list[recipe_id] = listy
new_countingIngredients[recipe_id] = countingIngredients[recipe_id]
PCA and T-SNE intended to decrease the dimensionality (50) of the vectors representing recipes, so that they can be plotted in visualizable way. Although some informamation was inevitably lost, a pair of the most variale components was used.
inputt = numpy.array(list(new_list.values())[0:10000]) # Choosing number of recipes under consideration.
print(inputt.shape)
# ---------------------------- PCA ----------------------------
X_recipes_embedded1 = PCA(n_components=2).fit_transform(inputt)
# ---------------------------- T-SNE ----------------------------
X_recipes_embedded2 = TSNE(n_components=2).fit_transform(inputt)
Finding groups of recipes that most correspond to different types of cuisine.
# ---------------------------- Build Distance Dataframe & Networkx Graph ----------------------------
data = list(X_recipes_embedded1) # list(X_recipes_embedded1) / id_recipe.values()
ctys = list(new_list.keys())[0:len(data)]
df = pandas.DataFrame(data, index=ctys)
# distances = (pandas.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)).rdiv(1)
# G = networkx.from_pandas_adjacency(distances) # Creating networkx graph from pandas dataframe.
# X = numpy.array(df.values) # Creating numpy array from pandas dataframe.
# ---------------------------- Clustering ----------------------------
recipeModules = SpectralClustering(assign_labels = "discretize", random_state = 0, n_jobs = -1).fit_predict(data)
# Mean Shift
# recipeModules = MeanShift(n_jobs=-1, cluster_all = True).fit(X).labels_
# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
# recipeModules = DBSCAN(eps=0.3, min_samples=2, n_jobs=-1).fit(X).labels_ # Noisy samples are given the label -1.
# Louvain
# recipeModules = list((community.best_partition(G)).values())
# Infomap
# recipeModules = infomap_function(1./distances, ctys)
Calculating the score of each recipe taking into account the number of cancer-beating molecules.
Data Source: Veselkov, K., Gonzalez, G., Aljifri, S. et al. HyperFoods: Machine intelligent mapping of cancer-beating molecules in foods. Sci Rep 9, 9237 (2019) doi:10.1038/s41598-019-45349-y
if not path.exists("./recipe_cancerscore.json"):
recipe_cancerscore = {}
# ---------------------------- Weighting Number of Anti Cancer Molecules Present in each Recipe ----------------------------
ingredient_anticancer_keys = list(ingredient_anticancer.keys())
j = 0
for recipe_id, recipe in final_grams.items():
# if j % 10000 == 0:
# print("hi")
recipe_cancerscore[recipe_id] = 0
for ingredients in recipe:
for ingredient in ingredients:
i = 0
aux = True
aux2 = False
while i < len(ingredient_anticancer_keys):
if aux and ingredient["ingredient"] == ingredient_anticancer_keys[i]:
#recipe_cancerscore[recipe_id] = recipe_cancerscore[recipe_id] + ingredient_anticancer[ingredient_anticancer_keys[i]]*(ingredient["quantity"])/(recipe_weight[recipe_id])
recipe_cancerscore[recipe_id] = recipe_cancerscore[recipe_id] + ingredient_anticancer[ingredient_anticancer_keys[i]]*1/(recipe_weight[recipe_id])
break
elif aux2 and ingredient["ingredient"] in ingredient_anticancer_keys[i]:
#recipe_cancerscore[recipe_id] = recipe_cancerscore[recipe_id] + ingredient_anticancer[ingredient_anticancer_keys[i]]*(ingredient["quantity"])/(recipe_weight[recipe_id])
recipe_cancerscore[recipe_id] = recipe_cancerscore[recipe_id] + ingredient_anticancer[ingredient_anticancer_keys[i]]*1/(recipe_weight[recipe_id])
break
elif i == len(ingredient_anticancer_keys) - 1 and aux and aux2 == False:
i = 0
aux = False
aux2 = True
i = i + 1
j = j + 1
else:
f_kaggleNature = open('./recipes_anticancerMolecules.csv', newline = '')
game_reader = csv.reader(f_kaggleNature, delimiter='\t')
recipe_cancerscore = {}
for game in game_reader:
if game[0].split(",")[0] != "id":
recipe_cancerscore[game[0].split(",")[0]] = float(game[0].split(",")[1])
Printing, in a decreasing order, the recipes with a bigger number of cancer-beating molecules.
# ---------------------------- Removing Recipes < 3 Ingredients ----------------------------
threshold = 3
recipe_cancerscore_reduced = [{"score": recipe_cancerscore[recipe_id], "id": recipe_id} for recipe_id, number in new_countingIngredients.items() if number >= threshold]
id_url_reduced = [{"url": id_url[recipe_id], "id": recipe_id} for recipe_id, number in new_countingIngredients.items() if number >= threshold]
recipe_cancerscore_reduced_adapted = {}
id_url_reduced_adapted = {}
for value in recipe_cancerscore_reduced:
recipe_cancerscore_reduced_adapted[value["id"]] = value["score"]
for value in id_url_reduced:
id_url_reduced_adapted[value["id"]] = value["url"]
# ---------------------------- Setting Visualization Options ----------------------------
pandas.set_option('display.max_colwidth', 1000)
# ---------------------------- Creating DataFrames ----------------------------
res1 = pandas.DataFrame.from_dict(recipe_cancerscore_reduced_adapted, orient='index', columns=['Anticancer Molecules/Number Ingredients'])
res2 = pandas.DataFrame.from_dict(id_url_reduced_adapted, orient='index', columns=['Recipe URL'])
# ---------------------------- Concatenating DataFrames ----------------------------
df = pandas.concat([res1, res2], axis=1).reindex(res1.index).sort_values(by=['Anticancer Molecules/Number Ingredients'], ascending=False)#[0:30]#.head()
df[0:30]
df.to_csv(r'./recipes_anticancerMolecules.csv', index=True)
Size of each marker is proportional to the number of ingredients a given recipe contains.
Markers with a similar color group recipes that contain the higher number of common ingredients.
%%time
# Color Recipes According to Number Anticancer Molecules
threshold = 10
sample = list(new_countingIngredients.values())[0:10000]
#modules_anticancer = [list(recipe_cancerscore.values())[index] for index, number in enumerate(sample) if number >= threshold] # Coloring recipe nodes based on the number of anticancer molecules present.
#modules_clustering = [list(recipeModules)[index] for index, number in enumerate(sample) if number >= threshold] # Clustering recipes using 1 out of 4 unsupervised learning methods.
#modules_cuisines = [clf.predict(list(recipe_cancerscore.keys())[index]) for index, number in enumerate(sample) if number >= threshold] #
modules_tea = []
for index, number in enumerate(sample):
if number >= threshold and list(recipe_cancerscore.keys())[index] in list(id_tea.keys()):
modules_tea.append("black")
elif number >= threshold:
modules_tea.append("orange")
input1 = numpy.array([X_recipes_embedded1[index] for index, number in enumerate(sample) if number >= threshold])
input2 = numpy.array([X_recipes_embedded2[index] for index, number in enumerate(sample) if number >= threshold])
input3 = [list(recipe_cancerscore.keys())[index] for index, number in enumerate(sample) if number >= threshold]
input5 = numpy.array([number for index, number in enumerate(sample) if number >= threshold])
norm = mpl.colors.Normalize(vmin=0, vmax=17)
cmap = cm.hot
m = cm.ScalarMappable(norm=norm, cmap=cmap)
modules_anticancer_colored = []
for number_antic_molecules in modules_anticancer:
modules_anticancer_colored.append(m.to_rgba(number_antic_molecules))
matplotlib_function(input1, input2, input3, modules_tea, input5 , "Recipes", False)
%autoreload 2
plotly_function(input1, input2, input3, modules_tea, input5, "false", "Recipes")
seaborn_function(input1, input2, input3, modules_anticancer_colored, input5)
# ---------------------------- Creating Synonymous to convert Kaggle&Nature to Recipe1M+ Ingredients ----------------------------
vocabulary = set()
for recipe in id_ingredients_cuisine:
for ingredient in recipe["ingredients"]:
vocabulary.add(ingredient.replace(" ", "_"))
synonymous = {}
for ingredient2 in list(vocabulary):
synonymous[ingredient2] = "new"
aux = 0
for ingredient2 in list(vocabulary):
for ingredient1 in ingredient:
synonymous[ingredient2] = ingredient1
break
elif ingredient1 in ingredient2:
synonymous[ingredient2] = ingredient1
if synonymous[ingredient2] == "new":
aux = aux + 1
new_id_ingredients_cuisine = id_ingredients_cuisine
for key1, recipe in enumerate(id_ingredients_cuisine):
for key2, ingredient in enumerate(recipe["ingredients"]):
if synonymous[id_ingredients_cuisine[key1]["ingredients"][key2].replace(" ", "_")] == "new":
new_id_ingredients_cuisine[key1]["ingredients"].remove(id_ingredients_cuisine[key1]["ingredients"][key2])
continue
new_id_ingredients_cuisine[key1]["ingredients"][key2] = synonymous[id_ingredients_cuisine[key1]["ingredients"][key2].replace(" ", "_")]
if len(id_ingredients_cuisine[key1]["ingredients"]) < 2:
new_id_ingredients_cuisine.remove(id_ingredients_cuisine[key1])
# ---------------------------- Save JSON File with Synonymous ----------------------------
with open('./vocabulary/synonymous.json', 'w') as json_file:
json.dump(synonymous, json_file)
new_id_ingredients_cuisine_vectorized = {}
for index_recipe, recipe in enumerate(new_id_ingredients_cuisine):
for ingredient in recipe["ingredients"]:
new_id_ingredients_cuisine_vectorized[index_recipe] = recipe["ingredients"]
recipe_vector = {}
for recipe_id, ingredients in new_id_ingredients_cuisine_vectorized.items():
recipe_vector[recipe_id] = []
for ingredient in ingredients:
recipe_vector[recipe_id].append(numpy.array(model[ingredient]))
recipe_vector[recipe_id] = numpy.sum(numpy.array(recipe_vector[recipe_id]), 0) / len(recipe_vector[recipe_id])
X = numpy.array([xi for xi in list(recipe_vector.values())])
y = numpy.array(cuisines)
random_numbers = list(range(0, len(cuisines)))
random.shuffle(random_numbers)
y_mod = [cuisines[index] for index in random_numbers]
X_mod = [X[index] for index in random_numbers]
%%time
if not path.exists('./trained_models/svc_model_cuisine_100.sav'):
parameters = {'C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000], 'max_iter': [5000], 'dual': [False], 'class_weight': ['balanced']}
svc = LinearSVC()
clf = GridSearchCV(svc, parameters, n_jobs=-1)
clf.fit(X_mod, y_mod)
else:
clf = pickle.load(open('./trained_models/svc_model_cuisine_100.sav', 'rb'))
pickle.dump(clf, open('./trained_models/svc_model_cuisine_100.sav', 'wb'))
Compute confusion matrix to evaluate the accuracy of a classification. By definition a confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i and predicted to be in group j.
def cm_analysis(y_true, y_pred, filename, labels, ymap=None, figsize=(20,15)):
"""
Generate matrix plot of confusion matrix with pretty annotations.
The plot image is saved to disk.
args:
y_true: true label of the data, with shape (nsamples,)
y_pred: prediction of the data, with shape (nsamples,)
filename: filename of figure file to save
labels: string array, name the order of class labels in the confusion matrix.
use `clf.classes_` if using scikit-learn models.
with shape (nclass,).
ymap: dict: any -> string, length == nclass.
if not None, map the labels & ys to more understandable strings.
Caution: original y_true, y_pred and labels must align.
figsize: the size of the figure plotted.
"""
if ymap is not None:
y_pred = [ymap[yi] for yi in y_pred]
y_true = [ymap[yi] for yi in y_true]
labels = [ymap[yi] for yi in labels]
cm = confusion_matrix(y_true, y_pred, labels=labels)
cm_sum = numpy.sum(cm, axis=1, keepdims=True)
cm_perc = cm / cm_sum.astype(float) * 100
annot = numpy.empty_like(cm).astype(str)
nrows, ncols = cm.shape
for i in range(nrows):
for j in range(ncols):
c = cm[i, j]
p = cm_perc[i, j]
if i == j:
s = cm_sum[i]
annot[i, j] = '%.1f%%' % (p)
elif c == 0:
annot[i, j] = '0.0%'
else:
annot[i, j] = '%.1f%%' % (p)
cm = pandas.DataFrame(cm_perc, index=labels, columns=labels)
cm.index.name = 'Actual'
cm.columns.name = 'Predicted'
fig, ax = pyplot.subplots(figsize=figsize, dpi=500)
seaborn.heatmap(cm, annot=annot, fmt='', ax=ax, cmap="Greens", annot_kws={'size':14})
pyplot.savefig(filename)
myorder = [6, 8, 10, 2, 7, 0, 4, 3, 9, 1, 5]
y_mod_mod = [y_mod[i] for i in myorder]
prediction = clf.predict(X_mod)
prediction_mod = numpy.array([list(prediction)[i] for i in myorder])
labels_mod = [list(set(y))[i] for i in myorder]
cm_analysis(y_mod, prediction, "confusion_poster", labels_mod)
print("Best estimator: " + str(clf.best_params_))
print("Mean cross-validated score of the best_estimator: " + str(clf.best_score_))
%%time
# ---------------------------- Adding Cuisines to Recipe1M+ Dataset ----------------------------
if not path.exists('./data/Recipe1M+/modified_modified_recipes_data_numeric1000.json'):
modified_modified_recipes_data = {}
modified_modified_recipes_data_numeric = {}
# ------------------------
cuisine2number = {}
for position, cuisine in enumerate(list(set(cuisines))):
cuisine2number[cuisine] = position
# ------------------------
ij = 0
for recipe_id, vector in new_list.items():
modified_modified_recipes_data[recipe_id] = clf.predict([new_list[recipe_id]])[0]
modified_modified_recipes_data_numeric[recipe_id] = cuisine2number[clf.predict([new_list[recipe_id]])[0]]
ij = ij + 1
else:
with open('./data/Recipe1M+/modified_modified_recipes_data_numeric.json') as data_file:
modified_modified_recipes_data_numeric = ast.literal_eval(data_file.read())
# ---------------------------- Generating New Recipe1M+ w/ Cuisines File ----------------------------
file = open('./data/Recipe1M+/modified_modified_recipes_data.json','w')
file.write(str(modified_modified_recipes_data))
# Recipe1M+ colored recipe nodes accordingly to the cuisine they belong.
matplotlib_function(input1, input2, input3, list(modified_modified_recipes_data_numeric.values())[0:len(input1)], input5, "Recipes", False)